IEEE Journal of Biomedical and Health Informatics — Latest Matching Preprints

1

Multi-BOUNTI: Multi-lobe Brain vOlUmetry and segmeNtation for feTal and neonatal MRI

Uus, A.; Fukami-Gartner, A.; Kyriakopoulou, V.; Cromb, D.; Morgan, T.; Arulkumaran, S.; Egloff Collado, A.; Luis, A.; Bos, R.; Makropoulos, A.; Schuh, A.; Robinson, E.; Sousa, H.; Deprez, M.; Cordero-Grande, L.; Bradshaw, C.; Colford, K.; Hutter, J.; Price, A.; O'Muircheartaigh, J.; Hammers, A.; Rueckert, D.; Counsell, S.; McAlonan, G.; Arichi, T.; Edwards, A. D.; Hajnal, J. V.; Rutherford, M. A.; Story, L.

2026-04-22 pediatrics 10.64898/2026.04.21.26351376 medRxiv

Top 0.1%

8.3%

Show abstract

Regional volumetric assessment of perinatal brain development is currently limited by the lack of consistent high quality multi-regional segmentation methods applicable to both fetal and neonatal MRI. We present Multi-BOUNTI, a deep learning pipeline for automated multi-lobe segmentation of fetal and neonatal T2w brain MRI. The method is based on a dedicated 43-label parcellation protocol and a 3D Attention U-Net trained on brain MRI datasets of subjects spanning 21-44 weeks gestational/postmenstrual age. The pipeline integrates preprocessing, segmentation and volumetric analysis, and was evaluated on independent datasets, demonstrating fast (< 10 min/case) and accurate performance with high agreement to manually refined labels. We demonstrate the application of the framework with 267 fetal and 593 neonatal MRI datasets from the developing Human Connectome Project without reported clinically significant brain anomalies to derive normative volumetric growth models across 21-44 weeks GA/PMA. These models were used to characterise developmental trajectories, assess differences between fetal and preterm neonatal cohorts, and analyse longitudinal changes. The resulting normative models were integrated into an automated reporting framework enabling subject-specific volumetric assessment via centiles and z-scores. Multi-BOUNTI provides a unified and scalable approach for perinatal brain segmentation and volumetry, supporting large-scale studies and facilitating future clinical translation. The full pipeline is publicly available at https://github.com/SVRTK/perinatal-brain-mri-analysis.

2

Assessing physiological coherence in stress related predictions of large language models: a surrogate based analysis of the Mistral 3 family using wearable HRV data

Bolpagni, M.; Pozza, M.; Gabrielli, S.

2026-04-27 health informatics 10.64898/2026.04.24.26351717 medRxiv

Top 0.2%

6.9%

Show abstract

Chronic psychological stress contributes to allostatic load and is associated with cardiovascular, metabolic, and mental health disorders. Wearable devices enable continuous, noninvasive monitoring of autonomic signals such as heart rate variability (HRV), creating new opportunities for real-time stress assessment. Large language models (LLMs) are increasingly explored as interfaces for interpreting such data, but it remains unclear whether their predictions reflect physiologically meaningful patterns or rely on superficial heuristics. In this study, we assess whether LLM-derived stress predictions are physiologically coherent and how this varies with model scale. Using a longitudinal wearable dataset collected in naturalistic conditions (35 participants; 5,100 five-minute windows with HRV and contextual features), we obtained stress pseudoprobabilities from three models in the Mistral 3 family (675B, 14B, 3B) via zero-shot prompting. To make model behavior interpretable, we trained surrogate models to approximate LLM outputs and analyzed feature-response relationships using SHAP. Our results indicate that surrogate models closely reproduced LLM predictions (R{superscript 2} up to 0.915; Cohen's k up to 0.941), enabling high-fidelity characterization of decision patterns and providing a practical framework for auditing the physiological coherence of LLM-derived predictions. Physiological coherence increased with model scale: the largest model exhibited near complete alignment with established HRV stress responses, together with stable, predominantly monotonic feature effects and a balanced integration of physiological and contextual information. This pattern weakened at smaller scales, with the mid scale model showing partial alignment and the smallest model displaying reduced stability, greater feature concentration, and more irregular, non monotonic relationships. These findings indicate that larger LLMs encode more physiologically consistent representations of stress, whereas smaller models rely on simplified and less stable strategies, and highlight the value of surrogate based analysis as a practical framework for evaluating LLM behavior in biomedical applications and supporting their responsible integration into wearable health analytics.

3

A Closer-to-Brain Heterosynaptic Learning Rule for Spatiotemporal Spike Pattern Detection with Low-Resolution Synapse

Furuichi, S.; Kohno, T.

2026-04-22 neuroscience 10.64898/2026.04.19.719429 medRxiv

Top 0.4%

3.7%

Show abstract

The brain is believed to process information efficiently in a different manner from deep learning-based artificial intelligence (AI). Brain-like next-generation AI is gaining attention owing to its potential to perform human-like, highly adaptive, robust, and power-efficient computation. To realize such AI, one crucial approach is the bottom-up implementation of the neuronal systems, capturing their electrophysiological characteristics in electronic circuits. However, this neuromorphic approach generally focuses on simplified neuronal models that do not refer to many biological findings. Developing closer-to-brain models is a natural direction that serve as a fundamental computing model for next-generation AI. One of the constraints of neuromorphic circuits is the bit resolution of synaptic efficacy memory, as the memory footprint scales with it precision. Although low-resolution synaptic efficacy is essential for minimizing memory circuit footprint and energy consumption, it generally leads to performance degradation in many tasks such as the spatio-temporal spike pattern detection. This study proposed a closer-to-brain learning rule that incorporates heterosynaptic plasticity (HP) induced by glutamate spillover. It is demonstrated that our model mitigates the performance degradation associated with low-bit resolution synaptic efficacy, achieving the pattern detection success rate with 3-bit resolution synaptic efficacy, which is comparable to 64-bit floating-point precision. Furthermore, the findings of the study indicate that HP based model accelerates the convergence of the synaptic effcacy and effectively potentiates the synapses relevant to the pattern detection while suppressing irrelevant ones, thereby promoting a bimodal distribution of synaptic efficacies. These findings may provide a basic framework for constructing an energy-efficient, brain-like next-generation AI that maintains high performance under hardware constraints.

4

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Luo, P.; Luo, D.; Li, D.; Xue, X.; Yang, J.; Gong, X.; Tang, K.

2026-04-22 bioinformatics 10.64898/2026.04.19.719450 medRxiv

Top 0.5%

3.6%

Show abstract

Transcriptomic data is highly sensitive to cancer state and progression, making transcriptome-based foundation models a great promise for diverse clinical ontological inference. However, analyses of transcriptome are conventionally hindered by technical batch effects and limited generalization across platforms. Here, we introduce RNABag, a foundation model designed to generalize well to external datasets. In particular, the model only focuses on highly variable genes to reduce noise; and extensive data augmentation was utilized to pretrain RNABag to learn robust representations, invariant to batch variations. We demonstrate that RNABag achieves superior performance in pan-cancer tissue-of-origin classification and cancer detection in internal validation sets, as well as in zero-shot generalization to external cohorts and in-house clinical samples. Furthermore, RNABag, after specialized finetuning, exhibits strong capabilities in a wide range of clinical applications. The model effectively stratifies patient survival and predicts relapse risks, highlighting key molecular pathways driving tumor progression. Crucially, we extend RNABags utility to liquid biopsies, achieving high diagnostic accuracy in plasma cfRNA and tumor-educated platelets (TEPs), thereby supporting its application in non-invasive cancer monitoring. Interpretability analysis revealed pivotal role of tumor immune escape in the cancer induced plasma cfRNA signals. In summary, our study indicates that cancer states and progression may be monitored in details and precision via comprehensive modeling of transcriptome across biopsy modalities.

5

Multimodal Integration of Ambulatory ECG and Clinical Features for Sudden Cardiac Death and Pump Failure Death Prediction

Swee, S.; Adam, I.; Zheng, E. Y.; Ji, E.; Wang, D.; Speier, W.; Hsu, J.; Chang, K.-W.; Shivkumar, K.; Ping, P.

2026-04-22 cardiovascular medicine 10.64898/2026.04.21.26351421 medRxiv

Top 0.6%

2.9%

Show abstract

Ambulatory electrocardiograms (ECG) provides continuous monitoring of the hearts electrical activity. However, many existing machine learning and artificial intelligence models for analyzing ambulatory ECG traces are often unimodal and do not incorporate patient clinical context. In this study, we propose a multimodal framework integrating ambulatory ECG-derived representations with clinical text embeddings to predict two cardiac outcomes: sudden cardiac death and pump failure death. Ambulatory ECG traces are preprocessed, segmented, and encoded via a multiple instance learning and temporal convolutional neural network framework. In parallel, patient clinical features are parsed into structured prompts, which are passed through a large language model to generate clinical reasoning; this reasoning passes through a biomedical language encoder to generate a text embedding. With the ECG and text embeddings, we systematically evaluate multiple fusion strategies, including concatenation- and gating-based approaches, to integrate these two data modalities. Our results demonstrate that multimodal models consistently outperform unimodal baselines, with adaptive fusion mechanisms providing the greatest improvements in predictive performance. Decision curve analysis highlights the potential clinical utility of the proposed framework for risk stratification. Finally, we visualize model attention across modalities, including ECG attention patterns, segment-level saliency, heart rate variability features, and clinical reasoning, to contextualize patient-specific predictions.

6

Kernel Matrix Completion with Topological and Spectral Features for Multi-Modal Classification

Rinon, E. M.; Visaya, M. V.; Sambayan, R.

2026-04-22 bioinformatics 10.64898/2026.04.19.713528 medRxiv

Top 0.6%

2.6%

Show abstract

Kernel methods offer a robust framework for integrating multi-modal datasets into a unified representation, thereby facilitating more comprehensive data interpretation. In the presence of incomplete datasets, multiple kernel learning is employed to enhance the efficiency of data completion and integration. We investigate kernel-based approaches to address the incomplete-data problem with applications to yeast protein data. Biological data such as yeast proteins can be represented through multiple modalities, including gene expression profiles, amino acid sequences, three-dimensional structures, and protein interaction networks. We introduce a computational pipeline based on kernel matrix completion, in which topological data analysis (TDA) and persistent spectral analysis are incorporated into the classification setting. TDA captures geometric structure across scales while spectral descriptors reflect connectivity patterns through Laplacian eigenvalues. Kernel, topological, and spectral descriptors are used with support vector machines to discriminate between membrane and non-membrane yeast proteins. Empirical results show that the combined pipeline improves both kernel completion accuracy and ROC performance relative to baseline kernel-only approaches. The best-performing configuration achieves an ROC score of 0.8632 using the average of three kernels augmented with TDA features. These results demonstrate competitive performance relative to strong kernel-based baselines under incomplete data conditions. The proposed approach provides a unified approach for learning from incomplete heterogeneous data while enriching kernel representations with geometric and spectral information.

7

Wavelet analysis reveals non-stationary cardiovascular rhythms associated with delirium and deep sedation in ICU patients

Sreekanth, J.; Salgado-Baez, E.; Edel, A.; Gruenewald, E.; Piper, S. K.; Spies, C.; Balzer, F.; Boie, S. D.

2026-04-23 health informatics 10.64898/2026.04.22.26351455 medRxiv

Top 0.7%

2.4%

Show abstract

Routine ICU data offers valuable insights into daily physiological rhythms. While traditional methods assume these cycles maintain fixed periods and amplitudes, their inherent variability requires dynamic estimation of instantaneous trends. Wavelet transform effectively resolves circadian oscillations, especially for frequently measured vital parameters. We present novel extensions to the Continuous Wavelet Transform (CWT) power spectral analysis to better detect and segment subtle temporal patterns. Using this approach, we uncover hidden circadian patterns in cardiovascular vitals such as Heart Rate (HR) and Mean Blood Pressure (MBP) measured over five days in a retrospective cohort of 855 ICU patients. By quantifying non-stationary rhythms, we identified diurnal and semi-diurnal oscillations varying in period and power according to delirium and deep sedation. Notably, HR exhibits a clear diurnal and semi-diurnal rhythm when delirium is absent. Overall, our framework supports the CWT as a powerful tool for analyzing complex physiological signals, particularly vital signs. Crucially, our findings suggest that cardiovascular rhythm disruption can be associated with ICU-related delirium and deep sedation.

8

Outcome Prediction Models for Critically Ill Patients Using Small Routine Laboratory Datasets

Cao, X.; Hou, J.; Wei, X.; Wang, Q.

2026-04-27 emergency medicine 10.64898/2026.04.26.26351758 medRxiv

Top 0.7%

2.2%

Show abstract

We present a suite of foundational, outcome prediction models for critically ill patients, developed using readily available, routine blood tests and advanced machine learning techniques. The input data of the models includes complete blood counts (CBCs), metabolic panels, and additional biomarkers that assess liver and kidney function, coagulation status, and cardiac injury. The output yields the predicted outcome at a given future horizon. For diagnoses, the length of the future horizon is set to zero, while it is set to a fixed time interval for prognoses. The training dataset in this study comprises clinical data from 332 ICU patients, augmented with 200 synthetic samples generated via a conditional diffusion model. Generative machine learning based data imputation and augmentation approaches yielded modest gains in predictive accuracy. However, substantial performance improvements were achieved through additional methods, including dimensionality and order reduction, SHAP based feature importance analysis, and a novel time series to image encoding strategy that enables the use of image based classifiers for temporal clinical data. Principal component analysis based order reduction produced measurable gains in outcome prediction, while the time series to image encoding proved particularly effective in mitigating small data limitations common in clinical research. Across all evaluation metrics, accuracy, precision, recall, F1 score, and AUROC, the prognostic models achieved performance exceeding 85\%, with some models attaining AUROC scores above 90%. We innovated a new model ensemble approach to optimize the predictive outcome. This ensemble modeling approach improves the overall prediction, pushing all assessment metrics over 90% . This work establishes a robust and interpretable AI enabled diagnostic and prognostic toolkit for outcome predictions in critically ill patients and demonstrates a scalable workflow for developing high performing models from sparse healthcare datasets. The proposed framework is readily deployable in ICU environments with routine blood testing capabilities and serves as a foundation for future integration into digital twin systems for critical care.

9

Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv

Top 0.8%

2.1%

Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.

10

Wearable Dual-Modality Plethysmography for Arterial Modulation and Blood Pressure Dip

Jung, S.; Thomson, S.

2026-04-21 physiology 10.64898/2026.04.17.719282 medRxiv

Top 0.9%

1.9%

Show abstract

Continuous, non-invasive cardiovascular monitoring is limited by the superficial sensing depth of Photoplethysmography (PPG), which is susceptible to peripheral artifacts. This study evaluates a wearable dual-modality prototype integrating dryelectrode Impedance Plethysmography (IPG) and PPG within a smartwatch form factor. Results from a pilot study (N=2) demonstrate that IPG signals exhibit a temporal lead over PPG across ventral and dorsal sites, supporting its greater penetration depth. During brachial artery modulation, IPG showed superior sensitivity to arterial recovery on the ventral forearm. Furthermore, 60-minute napping sessions revealed that while PPG remained morphologically stable, IPG signals underwent significant evolution, capturing distinct pulsewave archetypes. These findings suggest that wearable IPG provides a high-fidelity window into deep systemic hemodynamics typically reserved for clinical instrumentation.

11

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.9%

1.8%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

12

BridgeBP: A Toolbox for Bridging Brain Parcellations and Standardizing Structural Connectivity Matrices

Zhang, Z.; Liu, A. H.; Zhang, Z.

2026-04-21 neuroscience 10.64898/2026.04.17.718823 medRxiv

Top 0.9%

1.8%

Show abstract

Brain network analysis has emerged as a critical framework for understanding the complex organization and function of the human brain, underpinning insights into cognition, behavior, and neuropsychiatric conditions. Central to this approach is the parcellation of the brain into discrete regions, which simplifies high-dimensional connectome data and facilitates the investigation of network architectures. However, the proliferation of brain parcellation schemes introduces significant challenges: different parcellations often yield varying network sizes and measures, complicating cross-study comparisons and the reproducibility of findings. Moreover, most connectome construction pipelines are rigid, typically outputting connectivity matrices from only one or a few parcellation schemes, which limits flexibility. In this paper, we address these issues by introducing BridgeBP, a novel toolbox designed to bridge brain parcellations by leveraging continuous brain connectivity concepts. BridgeBP transforms structural connectivity matrices derived from one parcellation scheme into matrices corresponding to more than 40 alternative schemes, standardizing analyses and enhancing the robustness of network studies. Through extensive evaluations, we demonstrate that BridgeBP enables consistent network comparisons across diverse parcellation frameworks, paving the way for more reproducible and generalizable insights in brain connectome research.

13

Structure-aware graph attention based hierarchical transformer framework for drug-target binding affinity prediction

Kaira, V. S.; Kudari, Z. D.; P, S. S.; Bhat, R.; G, J.

2026-04-22 bioinformatics 10.64898/2026.04.19.719524 medRxiv

Top 1%

1.7%

Show abstract

Drug-target interaction prediction is significant in the hit identification phase of drug discovery, enabling the identification of potential drug candidates for downstream optimization. Traditional computational methods have some drawbacks in their ability to represent 3D structural data for both molecules and target proteins, which is required for the intricate protein-ligand interactions that regulate binding affinity. In this approach, we propose a graph transformer-based model (GTStrDTI) that combines an intragraph attention mechanism with cross-modal attention to enrich the representation of both the drug molecule and target protein. This approach comprehensively models both intramolecular structural features and intermolecular interactions, thereby enhancing binding affinity prediction performance. A thorough evaluation on benchmark datasets such as KIBA, DAVIS, and BindingDB_Kd shows that our approach surpasses the state-of-the-art methods under challenging target cold-start settings. Our analysis found that augmenting graph-based 3D structural protein target (C-alpha contact graphs from PDB with threshold distance of 5[A]) and incorporating molecule adjacency information, boosts predictive performance, thus contributing towards narrowing the gap between computational and experimental research.

14

Graph-Based Synthetic EHR Generation with Improved Quality-Privacy Trade-offs for Opioid Use Disorder Prediction

Alam, M. A. U.; Shalhout, S. Z.

2026-04-27 pain medicine 10.64898/2026.04.24.26351704 medRxiv

Top 1%

1.5%

Show abstract

Electronic health record (EHR) data are critical for clinical research but are challenging to share due to privacy and re-identification risks, particularly in sensitive domains such as opioid use disorder (OUD). Synthetic data generation offers a promising alternative; however, existing methods often struggle to preserve complex multivariate dependencies while maintaining a strong balance between data utility and privacy. The recently proposed MIIC-SDG framework leverages multivariate information theory and Bayesian network modeling to capture dependency structures and introduces Quality-Privacy Scores (QPS) to evaluate this trade-off, yet its capacity to model nonlinear relationships and support multi-task predictive settings remains limited. In this work, we propose a multi-task extension of TabGraphSyn, a graph-based generative framework for privacy-preserving EHR synthesis. The method constructs patient similarity graphs from high-dimensional tabular data and learns topology-aware embeddings via a graph convolutional network, which are then incorporated into a conditional variational autoencoder for synthetic data generation. Unlike prior approaches, our framework jointly models multiple clinically relevant OUD targets, including 180-day opioid abuse outcome, opioid concept group, and opioid source concept group, enabling preservation of label-dependent relationships across tasks. We evaluate TabGraphSyn against MIIC-SDG under a unified framework including multi-task predictive utility, distributional similarity, identifiability risk, membership inference risk, and QPS-based metrics. Results on the NIH All of Us dataset show that TabGraphSyn achieves a stronger overall utility-privacy balance, outperforming MIIC in most headline metrics, including higher synthetic multi-task ROC-AUC (0.5278 vs 0.4932), MetaQPS (AM: 0.0215 vs 0.0115; HM: 0.0391 vs 0.0223), while slightly underperforming in macro F1 (0.2321 vs 0.2840). These findings demonstrate improved modeling of nonlinear dependencies and more favorable quality-privacy trade-offs in multi-task settings, supporting its use for realistic and privacy-aware synthetic EHR data generation.

15

Aligned recordings of neural spiking activity and licking behavior in thirsty mice

Xu, Z.; Hong, B.; Li, L.; Xie, T.; Chen, Z.; Yao, H.; Zhang, T.

2026-04-23 neuroscience 10.64898/2026.04.21.720009 medRxiv

Top 1%

1.2%

Show abstract

Electrophysiological data, which serve as a biological signal that bridges neural activity and behavioral tasks, provide an innovative approach to neuroscience research. In this study, we constructed a dataset that contains over 2000 neurons across 117 days recorded in 20 mice containing 28,573 trials. Data for 5 mice were collected from the Secondary Motor Cortex (M2) region 8 mice was derived from the Ventrolateral Striatum (VLS) and 7 mice were from Substantia Nigra pars Reticulata (SNR). We induced licking behavior in head-fixed mice by periodically delivering water through a spout while simultaneously recording spiking activity from three brain regions and behavior related electrical signals. This dataset ensures precise temporal alignment between neural activity and behavioral events, offering a robust foundation for investigating neural encoding mechanisms and simulation of neural activities. This dataset establishes a precise spike-to-event mapping, which enables high decoding accuracy using Multilayer Perceptron (MLP) and Support Vector Machine (SVM). It can serve as a high-quality benchmark for developing encoding and decoding algorithms in neural networks, particularly Spiking Neural Networks (SNNs).

16

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv

Top 2%

0.9%

Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

17

Leveraging Open-Source Solutions to Build a Low-Cost Digital Pathology Pipeline for Translational Research

Stenberg, J.; Gullapalli, A.; Foucar, K.; Babu, D.; Redemann, J.; Joste, N.; Foucar, C.; Gratzinger, D.; George, T.; Ohgami, R.; Gullapalli, R. R.

2026-04-27 pathology 10.64898/2026.04.25.26350240 medRxiv

Top 2%

0.8%

Show abstract

Digital Pathology (DP) is a fast-emerging branch of pathology focused on digitizing pathology data. A key challenge of DP usage for pathology laboratories, especially mid- to small-sized clinical labs, are the upfront costs associated with instrumentation and the logistical challenges of implementation. In the current project, we built an end-to-end DP solution using low-cost, open-source components that is user-friendly at a small scale. We repurposed readily available microscopy components in a pathology lab to assemble a fully functional DP pipeline for translational research applications. We tested multiple low-cost complementary metal-oxide semiconductor (CMOS) cameras in this project and chose a user-friendly Canon camera for image acquisition. An open-source DP server solution, OMERO v.5.6.4, was used as the image management system (IMS) to host and serve the WSIs on an Ubuntu 22.04 operating system. The server-hosted WSI images were evaluated remotely and asynchronously by multiple pathologists physically situated in Albuquerque, NM; Salt Lake City, UT; and Palo Alto, CA. Each pathologist assessed the quality of the WSI pipeline, image quality, and WSI interaction experience using a 23-question survey. Overall, the custom, low-cost WSI pipeline was noted to be a robust and user-friendly experience by the pathologists. The current DP setup is unlikely to be useful as a commercial, scalable DP pipeline for large-scale clinical applications. However, it demonstrates the feasibility of creating customized, small-scale DP solutions (at a low price point) for asynchronous translational pathology research applications. Additionally, building customized DP pipelines provides excellent educational opportunities for pathology residents to gain in-depth knowledge of the various technical elements of a DP workflow. In summary, we have established a low-cost, end-to-end WSI DP pipeline useful for spatiotemporally asynchronous translational pathology research, in an academic setting.

18

Assessing Parent-cocreated Sensory Reactivity Outcomes in Children with Neurodevelopmental Disorders Undergoing Bumetanide Treatment: A Multiple-Baseline Single-Case Experimental Design

Geertjens, L. L. M. G.; Cristian, G.; Ramautar, J. J. R.; Haverman, L.; Schalet, B. B. D.; Linkenkaer-Hansen, K.; van der Wilt, G.-J.; Sprengers, J. J. J.; Bruining, H.

2026-04-23 psychiatry and clinical psychology 10.64898/2026.04.22.26351464 medRxiv

Top 2%

0.8%

Show abstract

Progress in pharmacological treatment development for neurodevelopmental disorders is hindered by a misalignment between targeted mechanisms, outcome measures, and trial designs. This study was initiated as a post-trial access pathway for bumetanide and later expanded with treatment-naive participants. Within this framework, we implemented a parent-cocreated sensory outcome measure set (PROMset) in an unmasked, multiple-baseline single-case experimental design with randomized baseline periods of 2-12 weeks, followed by 6 months of bumetanide treatment (up to 1.5 mg twice daily). Participants (7-19 years) had atypical sensory reactivity and a diagnosis of ASD, ADHD, epilepsy, or TSC. The primary outcome was a PROMset comprising seven PROMIS item banks assessing anxiety, depressive symptoms, sleep disturbance, fatigue, sleep-related impairment, cognitive function, and peer relationships. Secondary outcomes included SSP, SRS-2, RBS-R, and ABC. Of 113 enrolled participants (mean age 13.2 [SD 2.7], 64% male), 102 completed the trial and 95 had analyzable PROMsets. At baseline, PROMset scores showed substantial impairment across domains (mean deviation =9.0 T-score points, p<.001) and correlated with sensory reactivity (SSP; r=-0.40, p<.001). Individual-level analyses showed improvement in 24-41% of participants per PROM domain, most frequently in anxiety and depressive symptoms (41% and 38%; mean across-case Cohen's d=-1). Overall, 83% improved on at least one domain. Group-level analyses showed improvement across all secondary outcomes (p<.001), with superiority over historic placebo for RBS-R and SSP. Integrating PROMsets with individualized trial designs can reveal clinically meaningful changes, supporting a more sensitive and patient-centered framework for treatment evaluation in heterogeneous populations.

19

Assessing ageing, cognitive ability and freezing of gait in Parkinson's disease through integrated brain-heart network dynamics

Pitti, L.; Sitti, G.; Candia-Rivera, D.

2026-04-23 neurology 10.64898/2026.04.22.26351482 medRxiv

Top 2%

0.7%

Show abstract

Parkinson's Disease (PD) is a complex neurodegenerative disorder that manifests through systemic, large-scale physiological reorganizations. While research often focuses on region-specific neural changes, there is a growing need for multidomain approaches to capture the complexity of the disease and its clinical heterogeneity. This study proposes an analytical pipeline to evaluate Brain-Heart Interplay (BHI) as a novel systemic biomarker for neurodegeneration and healthy ageing. In this study we assessed BHI across three open-source datasets (EEG and ECG signals). We compared Healthy Young, Healthy Elderly, and PD patients in resting state to investigate the effects of ageing and cognitive performance. Additionally, we studied BHI trends in PD patients in the moment of freezing of gait (FOG). Methodologically, brain network organization was quantified using coherence-based EEG connectivity and graph theory, while heart activity was analyzed through Poincare plot-derived measures of cardiac autonomic activity. The coupling between these two systems was measured using the Maximal Information Coefficient to capture linear and non-linear dependencies between global cortical organization and cardiac autonomic outflow. The results demonstrate that BHI is a sensitive biomarker for detecting early multisystem dysfunction in both neurodegeneration and ageing. Furthermore, the identification of specific BHI trends during FOG onset suggests new opportunities for understanding the physiological mechanisms driving motor complications in PD. Our proposed pipeline provides a guiding tool for large-scale physiological assessment in clinical research.

20

Nanopore Whole-Genome Sequencing for Rapid, Comprehensive Molecular Diagnostics of Brain Tumors in Adult Patients

Halldorsson, S.; Nagymihaly, R. M.; Bope, C. D.; Lund-Iversen, M.; Niehusmann, P.; Lien-Dahl, T.; Pahnke, J.; Bruning, T.; Kongelf, G.; Patel, A.; Sahm, F.; Euskirchen, P.; Leske, H.; Vik-Mo, E. O.

2026-04-24 pathology 10.64898/2026.04.23.26351563 medRxiv

Top 2%

0.7%

Show abstract

Background: Classification of central nervous system (CNS) tumors has become increasingly complex, raising concerns about the sustainability of comprehensive molecular diagnostics. We have evaluated nanopore whole genome sequencing (nWGS) as a single workflow to replace multiple diagnostic assays. Methods: We performed nWGS on DNA extracted from 90 adult CNS tumor samples (58 retrospective, 32 prospective) and compared the results to findings from standard of care (SoC) diagnostic work-up. Analysis was done through an automated workflow that consolidated diagnostically and therapeutically relevant genomic alterations, including copy-number variation, structural, and single-nucleotide variants, chromosomal aberrations, gene fusions, and methylation-based classification. Results: nWGS supported final diagnostic classification in all samples with >15% tumor cell content, requiring ~3 hours of hands-on library preparation, parallel sample processing, and sequencing times within 72 hours. Methylation-based classification was available within 1 hour and was concordant with the integrated final diagnosis in 89% of cases (80/90). All diagnostically relevant copy-number variations, single-nucleotide variants, and gene fusions were concordant with SoC testing. MGMT promoter methylation status matched in 94% of cases. In addition, nWGS identified prognostic and potentially actionable variants that were not reported or covered by SoC. Conclusions: nWGS delivers comprehensive genetic and epigenetic results with a fast turn-around compared to standard methods. This enables efficient, accurate, and scalable molecular diagnostics of CNS tumors using a single platform. This data supports its implementation in routine clinical practice and may be extended to other cancer types requiring complex genomic profiling.